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Abstract 

Background: In the past years, significant progress has been made to develop and use experimental settings for 
extensive data collection on tobacco smoke exposure and tobacco smoke exposure-associated diseases. Due to the 
growing number of such data, there is a need for domain-specific standard ontologies to facilitate the integration 
of tobacco exposure data. 

Results: The CSEO (version 1.0) is composed of 20091 concepts. The ontology in its current form is able to capture 
a wide range of cigarette smoke exposure concepts within the knowledge domain of exposure science with a 
reasonable sensitivity and specificity. Moreover, it showed a promising performance when used to answer domain 
expert questions. The CSEO complies with standard upper-level ontologies and is freely accessible to the scientific 
community through a dedicated wiki at https://publicwiki-01.fraunhofer.de/CSEO-Wiki/index.php/Main_Page. 

Conclusions: The CSEO has potential to become a widely used standard within the academic and industrial 
community. Mainly because of the emerging need of systems toxicology to controlled vocabularies and also the 
lack of suitable ontologies for this domain, the CSEO prepares the ground for integrative systems-based research in 
the exposure science. 
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Background 

Recently, there has been an increased focus in systems 
toxicology on systems-oriented methodologies that 
emphasize the understanding on the biological impact of 
chemical exposures with increased mechanistic granularity 
[1,2], In particular, a recent report by the US National 
Research Council Committee on Toxicity Testing and 
Assessment of Environmental Agents advocates for a shift 
away from toxicological assessment at the level of apical 
endpoints towards the understanding of the effects of an 
exposure on toxicity pathways [3]. Moreover, the Food 
and Drug Administration (FDA) recently describes a 
system-based omics-approach to discover pulmonary 
biomarkers and to improve the evaluation of tobacco 
products [4], This indicates a growing recognition that 
exposure science should be considered as an integrated 
part of a systematic approach for risk assessment [5]. 
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To assess biological responses to environmental 
exposure, a systems-based approach attempts to apply an 
integrative strategy. A systems-based approach integrates 
a continuous model from the starting point of exposure to 
disease outcome [6]. A typical limitation in systems 
approaches is the lack of standards for harmonization of 
heterogeneous data types that are experimentally obtained 
from different resources. Such data types often have 
various structures, formats and annotations, which 
adversely affect the degrees of their interoperability and 
flexibility for integrative methods. Standard terminologies 
and proper contextual information are necessary for data 
sharing, reuse, and integration [7]. Recently, biomedical 
ontologies have emerged in support of systems approaches 
by facilitating the annotation of bio-simulation models 
and flexible access to knowledge [8]. The main purpose of 
ontologies is to organize data and information of a 
particular knowledge domain in a structured, controlled, 
and standard manner. Thus the data can be shared among 
scientists in different research areas or accessed and 
interpreted using different computational tools. The core 
of any ontology is a controlled vocabulary that attempts to 
describe a unified definition for all terms and concepts in 
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a particular subject area [9]. A good example is the Gene 
Ontology (GO) that provides a controlled vocabulary 
describing the roles of genes and their products in various 
organisms [10]. 

At the heart of systems toxicology is the understanding 
of signaling pathways perturbed by biologically active 
substances and the identification of those that have the 
potential to cause adverse health effects in humans. This 
requires integrating OMICs data with in vitro and in vivo 
toxicological endpoints. The goal of systems toxicology is 
therefore to link disease susceptibility at the molecular 
level to environmental stress or toxicant effect at the 
clinical level. Despite advances in various aspects of 
toxicogenomics, semantic representation of toxicological 
data and endpoints is still in its infancy. A variety of tools, 
platforms, and workflows coexist but each uses its own set 
of terms and ontologies, a challenge for data exchange. 
Hardy et al. [11] in their review provide an overview of 
existing toxicology vocabularies and ontologies that are 
currently being used in predictive toxicology initiatives 
and applications [11]. 

Recently, the toxicology OpenTox ontology has been 
developed to support standard representation of 
relations between chemical and toxicological datasets 
and experiments by unified terms. It is part of the 
OpenTox framework, which aims at unifying access to 
toxicity data, predictive networks, and validation procedures 
[12]. One of the advantages of the OpenTox ontology is the 
combination of several related ontologies that cover 
common information for chemical compounds, chemical 
datasets, algorithms, models, assays, in vivo studies, and 
toxicological endpoints. Moreover, when integrated in a 
semantic environment, the OpenTox ontology service 
facilitates registering new resources, remote access, and 
searching datasets using SPARQL. However, the OpenTox 
remains a high-level ontology and does not include concept 
granularity for the majority of its components in particular 
for the domain of environmental exposure. 

Lately, the exposure ontology (ExO) has been proposed 
to provide the missing link between exposure science and 
various environmental health disciplines, including 
toxicology [13]. The main advantage of the ExO is that it 
provides the first semantic template for representation of 
exposure information around the following four root 
concepts: exposure stressor, exposure receptor, exposure 
event, and exposure outcome. Although the current 
version of the ExO includes very general and high-level 
concepts to cover the breadth of the exposure knowledge 
domain, it still lacks sufficient granularity that is required 
to capture detailed information. Besides, the ExO is not 
compliant with the proposed upper-level ontology 
standards such as the Basic Formal Ontology (BFO) [14] 
or the Descriptive Ontology for Linguistic and Cognitive 
Engineering (DOLCE) [15], which makes its integration 



with existing or new ontologies semantically more 
difficult. Furthermore, Thomas et al. [16] describe the 
use of a Smoking Behavior Risk Ontology (SBRO) to 
represent risk models for phenotypes associated to 
tobacco smoking behavior [16]. However, the scope of 
their ontology is limited to nicotine pharmacokinetics, 
pharmacodynamics, nicotine dependence, and clinical 
smoking cessation outcomes. 

Exposure to tobacco smoke is considered an environ- 
mental risk factor to human health and it is involved 
in the initiation and progression of several respiratory 
diseases including chronic obstructive pulmonary dis- 
eases (COPD) and lung cancer [17,18]. Elimination or 
minimization of exposure to cigarette smoke provides a 
clear opportunity to prevent related diseases. Although 
experiments that measure exposure to environmental 
tobacco smoke follow - to a large extent - the typical 
protocols used in toxicology experimental settings, no 
semantic framework capturing information specific to the 
domain of cigarette smoke exposure risk is available. 

In response to the need for semantic representation of 
the environmental exposure knowledge domain with 
particular focus on the cigarette smoke exposure risk, 
the Cigarette Smoke Exposure Ontology (CSEO) was 
developed. 

Results 

Purpose of the cigarette smoke exposure ontology 

The development of an ontology starts by defining its 
domain and scope. The scope of the CSEO was defined 
based on the potential application of the ontology in the 
domain of environmental exposure and was focused on 
exposure to cigarette smoke. Since setting a proper 
scope helps draw boundaries to the knowledge domain 
included in the ontology, the CSEO is intended to 
include all concepts and terms that represent processes 
and elements involved in conducting cigarette smoke 
exposure experiments, in association with cigarette-smoke 
related diseases (Figure 1). 

The scope of the ontology revolves around the exposure 
experiment' concept and covers description of sampling 
and experimental factors, test items, test systems, exposure 
condition, and link to diseases. These are the main 
concepts to be included in the CSEO by following the life 
cycle of ontology building, as described in the Methods 
section. Axiomatisation of concepts in the CSEO is based 
on the axioms provided in the BFO and ExO. For example, 
the description of an exposure follows the lines of the 
"exposure event" class in the ExO. We have, furthermore, 
enriched the ExO classes with extra classes that make the 
ontology more specific to cigarette smoke rather than just 
to exposures in general. The reason for choosing these 
concepts is that they represent the major players in systems 
toxicology studies conducted in the domain of smoke 
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Figure 1 High-level schematic representation of the CSEO scope. The scope of CSEO was designed around the key concept of exposure 
experiment and its substantial elements. 



exposure. Most exposure experiments follow a similar 
routine summarized as follows: the design, factors, and 
protocols of an experiment must be defined before 
conducting the experiment. This is often the case for 
exploratory systems-based approaches and lesser the case 
for validated assays. The two main components of an 
experiment are often a test system and test item, where the 
test system describes the exposure receptor (e.g., a clinical, 
in vivo, or in vitro setup), and the test item describes the 
exposure stressor (e.g., chemical compounds, cigarette 
smoke, and its characterization). Both of these components 
require terms that clearly specify the items. These two 
components interact in an exposure experiment and their 
interaction is described by the exposure conditions, for 
example, exposure transport path, frequency, and doses. 
The exposure condition, therefore, connects the test 
system and the test items under the experiment description. 
The exposed test system itself includes sampling proce- 
dures, which are bound to various endpoint measurements. 
In the case of systems-based approaches, the sampling 
procedures cover a large number of procedures. The 
sampling of the test items together with the endpoint 
measurements leads to an outcome, which may be 
associated with respiratory system diseases. 

The main purpose of the ontology is to support annota- 
tion of experimental data sets such as the details of the 
experiment and its design, description of test item, test 
system, as well as the exposure path to outcomes. 
Additional file 1 shows an example on the use of CSEO to 
annotate experiments. GeneChip Microarray experi- 
ments generate high-throughput transcriptomic data 
that can be reused for other research topics than the 
originally designed experiment. Therefore, the FGED 



(Functional Genomics Data) society created standards 
to exchange these and other similar data types related to 
functional genomics. These standards not only include the 
format of exchange but also the minimum requirements 
for experimental annotation so that experimental data can 
be correctly reproduced and reused. The exchange file 
format is called MAGE-TAB [19], which includes an 
IDF file for the definition of the investigation, a SDRF 
file for the specification of each sample, and an ADF 
file for the specification of the microarray analyte layout. 
This file format is supported by the repository ArrayExpress 
[20] and gives open access to a large number of functional 
genomics datasets. 

While MAGE-TAB defines the exchange format, there 
is another standard that describes the required annotation 
level, MIAME [21] the Minimum Information About 
a Microarray Experiment. Additional file 1 shows an 
example of the SDRF file that is MAGE-TAB and 
MIAME compliant. Each row indicates the biological 
samples with annotations and protocols for biological 
sample transformation. The data model starts with a 
subject, which is an animal model including additional 
information about type, strain, and gender. When a protocol 
applies, the biomaterial is changed, here from an untreated 
animal to a treated animal. The treatment is further 
described with the exposure item, brand, smoking regimen, 
nicotine concentration, exposure path, and exposure 
duration. The next protocol defines a post-exposure 
treatment and affects only part of the samples. After 
all exposures, the animal is dissected into organ parts 
that are described by the next protocol. The organ 
part is now further defined as frozen alveolar tissue area 
from left lung of each animal. The next protocols define 
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lysis in this tissue and the extraction of RNA that is 
hybridized on a GeneChip. The SDRF file ends with the 
reference to the raw data file names, processed data file 
name, and a summary of all experimental factor values. 
All protocols are defined in the IDF file (not shown). 
MAGE-TAB requires the use of ontology defined 
terms. The ontology resource is specified with location 
and version in the IDF. Yellow marked columns in 
Additional file 1 show the CSEO annotations that 
cover a large fraction of the SDRF file and ensure rich 
and proper annotation. The annotation level of this file is 
much richer than the MIAME requirement and supports 
the reproducibility and reusability of experimental data. 

Furthermore, conceptualizing and organizing this 
knowledge domain in the form of an ontology allows 
efficient augmentation of biological knowledge retrieval 
and extraction. Therefore, the sensitivity to which 
biological mechanisms are modulated in response to 
different risk factors posed by smoking toxicants in 
the lungs can be captured. 

Framework and architecture of the CSEO 

The CSEO was designed to be compliant with the Basic 
Formal Ontology (BFO). The BFO was adopted to define 
the upper-level standard architecture. The BFO is designed 
to support development of domain ontologies for scientific 
research [22]. On the other hand, the ExO is the only 
existing and intuitive semantic framework used by the 
exposure science community that provides a good 
template for plugging in subdomain ontologies related 
to the exposure domain. Therefore, the ExO superclasses 
were used as root concepts for the CSEO. Accordingly, 
the CSEO populates the ExO for the concepts of the 



cigarette smoke risk subdomain and also complies 
with requirements of the OBO Foundry and RO 
(Relation Ontology). Figure 2 depicts the architecture 
of the CSEO in relation to BFO and ExO and its main 
classes. Such an architecture is expected to incorporate 
provenance into the CSEO so that concepts can be traced 
back to their corresponding upper-level classes in ExO 
and BFO. 

The CSEO comes in two different versions: the main 
CSEO version is a BFO-compliant ontology, and the 
second version is a controlled vocabulary version, 
hereafter referred to as "lexical version". The CSEO-BFO 
version consists of the BFO top-level hierarchy into 
which the adjusted ExO hierarchy was plugged. The 
CSEO classes were organized underneath these layers as 
a third layer of granularity. This is the so-called 
"computer-readable" format of the CSEO, which 
represents the formal ontology. The lexical version, on 
the other hand, forms the so-called "expert-readable" 
format and does not claim to be a standard-adhering 
ontology in itself. Instead, it is an access point to the 
CSEO classes that is intuitive and easy to navigate for 
medical and biological experts. This lexical version 
supports the creation and review of the ontology by 
various experts within the field. It, furthermore, creates 
a categorization of ontology classes and terms into 
context categories' inside the knowledge domain. This is 
usable also for context-sensitive text mining i.e., it 
contains a branch that collects all terms related to 
exposure outcomes (including terms which are not 
necessarily exposure types) compared to the CSEO-BFO 
version where they have to be collected manually. Both 
versions are available on the CSEO dedicated wild website. 



BFO Basic Formal Ontology 





Figure 2 Schematic representation of the main ontology classes and class provenance between BFO, ExO, and CSEO. Arrow legend: 
blue: is-a relations; yellow: ExO: is_associated_with; orange: ExO: interacts_with_an_exposure_stressor_via; brown: MGED: has_experiment_ design; 
violet: ExO: interacts_with. 
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Three-dimensional evaluation of the CSEO 
Structural measure 

Measurement of the structural dimension of the ontology 
reflects the organizational patterns of the concepts in 
the ontology. The first draft of CSEO (version 1.0) is 
composed of 20091 concepts, including the BFO and 
ExO classes. Additional file 2 provides several metrics 
on structural properties of the ontology. These metrics 
include 'breadth; which relates to the cardinality of paths; 
'depth', which relates to the cardinality of paths in a 
graph; 'tangledness', which relates to multi-hierarchical 
nodes; and Tanout factor', which relates to the dispersion 
of nodes. 

As shown in Additional file 2, the high number of 
classes and leaves together with high values for average 
width and the fanout factor, point towards a broad 
coverage of concepts by the ontology whereas the values 
for depth show specificity of the concept types to the 
domain of cigarette smoke exposure risk. The tangledness 
factor of 0.71 indicates the presence of multi-hierarchical 
nodes in the ontology (i.e. categories having multiple 
parents). This is beneficial when greater crosslinking of 
the domain concepts is desired. Different relation types 
from RO were used to relate concepts in the CSEO 
including part_of \ precedes^ 'has_participant; etc. Figure 2 
illustrates the relational view of the second-level concepts 
in the CSEO. 

Functional measure 

Measuring the functional dimension of the ontology 
indicates how well the conceptualization of the ontology 
captures the semantic space of the knowledge domain. The 
lexicalized ontology was used to calculate precision, recall, 
and F-score values (69.23, 77.81, 73.26, respectively). 

The result of this evaluation shows that the ontology 
in its current form is able to capture a wide range of 
concepts related to cigarette smoke exposure in the 
knowledge domain of exposure with a reasonable sensitivity 
and specificity towards manual curation. The F-score of 
above 73% reflects the quality output of the ontological 
search in the published knowledge domain of cigarette 
smoke exposure risk. 

Usability profile 

Usability profile of an ontology is defined by the extent of 
user-friendliness of the ontology in terms of easy navigation, 
knowledge accessibility, and meta-information availability. 
Navigation of the CSEO and its user interface has been 
facilitated using the WebProtege software, which provides a 
web-based access to the content of the ontology without 
the need for software installation [23]. By following the 
hyperlink provided on the wiki website under "CSEO 
access", the user is directed to the WebProtege page in 
which clicking CSEO launches the formal BFO-compliant 



ontology whereas clicking CSEO-Expert Readable hyperlink 
launches the hierarchy of controlled vocabulary underlying 
CSEO. The search field makes it possible to search for any 
CSEO-related concept and locate it in the tree (Figure 3). 
Feedbacks can be provided through the same portal and a 
dedicated team will process them. 

To increase the level of efficiency in accessing different 
views (subdomains) of the ontology, the ExO root concepts 
were used for further classification of the CSEO instants. 
By this means, tracking exposure-specific concepts for 
users becomes easier and more efficient. Meta-information 
(i.e. annotations including synonyms, definition, and 
reference) is provided for each concept in the CSEO 
to enable users accessing relevant information. 

Since a proper documentation is needed to ensure 
direct access and efficient usability of the ontology, a 
wiki environment was created that contains instructions for 
using the ontology, documentation on purpose and scope of 
the ontology, and information about interfacing to the 
ontology. The wiki is accessible through the following 
hyperlink in FireFox and Safari browsers: https://publicwiki- 
01.fraunhofer.de/CSEO-Wild/index.php/Main_Page. 

Use-case scenario: answering competency questions by 
experts 

Ontology-driven information retrieval and extraction 
systems will guide analysis of literature in precisely 
answering complex scientific questions [24]. The lexicalized 
form of the CSEO was used to automatically retrieve 
and extract domain specific knowledge related to 
cigarette smoke exposure risk from PubMed abstracts 
(see Methods). Experts in the knowledge domain of 
cigarette smoke exposure risk were asked to design 
several complex questions to be posed to the ontology. 
The following questions were considered to test the 
performance of the ontology: 

- What are the potential effects of the toxicity 
induced by tobacco smoke constituents on smokers? 

- Which toxicological studies are available that 
measure total particulate matter in electrically 
heated cigarettes? 

- Which documents report on the use of experimental 
mouse models for investigating the effect of 
cigarette smoke exposure on the risk of COPD? 

Queries were formulated in the SCAIView environment 
using the CSEO terminology. SCAIView displays named 
entities by markup of the text (e.g. PubMed abstracts). 
The key feature of SCAIView is the possibility to perform 
ontological search in biomedical text using concept 
hierarchies and synonyms associated with each concept in 
the ontology. While using the ontology in SCAIView, the 
hierarchical organization of the ontology was preserved by 
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Figure 3 Illustration of term search and navigation through the CSEO. 



transforming the ontology OWL file into an XML tree 
structure. Subsequently, retrieved documents were 
manually checked for containing correct answers to 
the posed competency questions. Table 1 summarizes these 
queries, their corresponding retrieval rate, and reference to 
the relevant documents that contain correct answers 
to competency questions. Titles of both relevant and 
irrelevant abstracts are listed in Additional file 3. 

These results indicate that application of the CSEO- 
derived terminology to the semantic literature search 
leads to retrieval of highly relevant publications containing 
the correct answer to the posed competency question. 
Moreover, highlighted CSEO concepts (terms) by 
SCAIView allow users to detect and extract knowledge 
statements, as illustrated in Figure 4. The CSEO termin- 
ology can be accessed through the SCAIView search 
engine under: www.scaiview.com/scaiview-academia.html. 

Discussion 

The CSEO covers relevant concepts in the field of 
systems-based toxicology assessment and includes many 



terms from the conventional toxicology assessment. Thus, 
the CSEO enables users to capture and integrate exposure 
information from the beginning of the experiment to the 
point of outcome measurement. Compared to other 
relevant ontologies, the CSEO covers a large number 
of concept classes including the 44 external ontologies. 
Additionally, the CSEO uses semi-automated methods for 
the term extraction and evaluation and therefore ensures 
good coverage of the knowledge domain. 

Another advantage of the CSEO over the existing 
related ontologies is the enrichment of high-resolution 
concepts that extends the higher-level exposure ontology 
in areas where existing ontologies are particularly weak. 
For instance, the CSEO describes mouse and rat strains 
that are commonly used in exposure experiments, 
includes human anatomy with a dedicated subclass to 
microanatomy of the respiratory system, and articulates 
staging of progressive diseases. Moreover, the CSEO can 
be used for text mining and knowledge discovery purposes 
because the CSEO is a lexicalized ontology that supports 
ontology-driven information retrieval and extraction as 



Table 1 Answering competency questions using CSEO-driven semantic search in PubMed abstracts 



Query (22.03.2013) 


No. of retrieved 
docs: 


No. of relevant 
docs: 


PMIDs of relevant documents: 


(([CSEO: "Smoke Constituent"]) AND [CSEO" Toxicity"]) 
AND [CSEO: 'Tobacco"] 


21 


1 7 (80.95%) 


14521141 [25], 1188959 [26], 18848577 [27], 21651432 
[28], 17661226 [29], 2002748 [30], 12857635 [31], 
19330121 [32], 14698566 [33], 11731039 [34], 
18383128 [35], 16859820 [36], 21651433 [37], 
21417965 [38], 2165143 1 [39], 15072838 [40] , 
18464053 [41] 


([CSEO: "Electrically heated cigarette"]) AND [CSEO: 
'Total Particulate Matter"] 


7 


7 (100%) 


12975773 [42], 12975774 [43], 14698566 [33], 12975771 
[44], 18590791 [45], 12975772 [46], 16963170 [47] 


(([CSEO: "Mouse model"]) AND [CSEO: "Cigarette Smoke 
Exposure"]) AND [MeSH Disease: "Pulmonary Disease 
Chronic Obstructive"] 


9 


9 (100%) 


20133926 [48], 19017996 [49], 23044435 [50], 22279084 
[51], 18988919 [52], 21700603 [53], 20228194 [54], 
19491340 [55JJ6510458 [56] 
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1 . Chemical composition, cytotoxicity and mutagenicity of smoke from US commercial and 
reference cigarettes smoked under two sets of machine smoking conditions. 

iTi PubMed 14698566 Authors: Roemer. E. Stabbert. R. Rustemeier. K; Veltel. D J: Meisgen. T J; Reininghaus. W; Carchman. R A. Gaworski. C L. 
Podraza, K F Date: 2004-0 Journal: Toxicology Affiliation: 
Statistics 

Eight Diended US market cigarettes, two blended referen ce cigarettes, one Bnght tobacco only reference cigarette ana an electrically heated prototype 

cigarette (EHC) were smoked under US Federal Trade Commission (FTC)/ln:erna:ional Organisation for Standardisation (ISO) conditions and under 
Massachusetts Department of Public Health (MDPH) conditions Smoke was analysed for chemical composition and in vitro toxicity Yields (quantity cigarette; 
of smoke constituents were higher under MDPH conditions compared to FTC/ISO conditions (market and r eferen c e average approximately 2.5 times; EHC 
approximate y ' 6 times) Consistent with the higher yields in vitro toxicity per cigarette was also higher under MDPH conditions. Concentrations (quantity/mg 
TPM) of nearly a smoke constituents measured decreased with increasing total particulate matter ;TPM) yields as regression analyses indicated. Higher TPM 
yields a so tended to be associated with slightly less cytotoxic and mutagenic activity per milligram TPM. Blended reference cigarettes tracked market 
cigarettes with simi ar TPM yield The Bright okjaw H B displayed high cytotoxicity but low mutagenicity, while in vitro activity of the EHC was remarkably low. 
The TPM-dependent decreases for the market range of 5-20 mg TPM cigarette were about 20%. irrespective of whether the increased yields were due to 
smoking conditions or cigarette construction. At the same TPM yield, the smoke constituent concentrations and in vitro toxicity were similar for low- and 
high-yield cigarettes 

Figure 4 An example of highlighted CSEO terms in the PubMed abstracts as appears in the SCAIView environment. The highlighted 
terms guide users to informative statements and facilitates their detection, quality check and extraction. 



described in the application scenario. Finally, the ability to 
use the CSEO in different systems may be facilitated 
by the BFO upper-level ontology. Thus, various sub- 
ontologies relevant to exposure can be integrated with 
the ExO-CSEO structure under the BFO framework. 

Similar to other ontologies, the CSEO suffers from the 
sparse granularity and misclassification of concepts in 
some parts of the ontology. Other shortcomings common 
to all ontologies such as missing concepts, lack of standard 
definitions, and incompleteness of synonym lists should be 
addressed by engagement of the research community and 
inclusion of their feedback in the process of ontology 
enrichment. To facilitate the community contribution, a 
website has been prepared with the aim of collecting users' 
feedback and providing access to the latest version of the 
ontology. With the public release of the ontology, it is 
hoped to reach out to the broader community and collect 
feedback and comments, which will be integrated in the 
future versions of the CSEO and be used to improve the 
ontology. With the version 1.0 of the CSEO, the ontology 
is sufficiently established to be useful for the scientific 
community. Furthermore, the project team will continue 
to review articles, abstracts, and other resources relevant 
for the domain and to extract novel terms and synonyms. 
New releases of the CSEO will be announced and made 
available through the NCBO s bioportal. 

Conclusions 

With the creation of the CSEO including relevant terms 
for describing exposure experiments, it can serve as a 
powerful glossary for definition finding and relationship 
visualization, facilitating the right use of terms. The 
CSEO has the potential to grow in the future and be used 
as a dictionary for various processes such as controlling 



internal documents (e.g. Excel Workbooks) or efficient use 
of Laboratory Information Management Systems (LIMS). 
This functionality can be used for the identification of 
relevant information (internally or publicly) or for the 
extraction of relevant knowledge statements. 

Methods 

Defining scope of the CSEO 

To define the scope, a qualitative survey was performed 
involving various experts in the domain of environmental 
exposure. Experts in toxicology, molecular biology, and 
clinical pathology fields in PMI were consulted and asked 
for their input on the concept classes that they deem 
as necessary to describe the knowledge domain of 
environmental exposure from their viewpoint. Based on 
this input, boundaries of the knowledge domain to be pre- 
sented by CSEO was determined as depicted in Figure 1. 

Resources and tools 

Different resources were used for construction of the 
ontology (Additional file 4). General and common 
concepts, for which an established ontological definition 
exists, were captured. 44 publicly available ontologies 
listed in Additional file 4 were re-used and the relevant 
terms/classes/concepts were selectively integrated in the 
CSEO along with their annotations. Specialized terms were 
collected from various contributors mainly used for internal 
process and workflow tracking in systems, such as Labora- 
tory Information Management Systems (LIMS). Literature 
sources either were searched by keywords (e.g. smoke, tox- 
icity, cigarette, tobacco in PubMed) or were recommended 
by experts (e.g. CORESTA publications or handbooks). 
Additionally, relevant publicly available abstracts, a number 
of relevant full-text articles, as well as "The Handbook of 



Younesi et al. Journal of Biomedical Semantics 2014, 5:31 
http://www.jbiomedsem.eom/content/5/1/31 



Page 8 of 1 1 



Cigarette Smoke Toxicity" by David Bernhard were 
reviewed. Here, relevant text bodies were manually an- 
notated, relevant terms were extracted and enriched 
with synonyms and integrated into the ontology. 

The Protege 4.2 (Build 276) [57], developed and 
maintained by The National Center for Biomedical 
Ontology together with its inbuilt HermiT 1.3.3 reasoner 
[58] were used to construct the ontology. The Knowtator 
plugin [59] was used for manual annotation of abstracts 
inside the Protege environment. The text-mining tool 
ProMiner [60] was utilized for named entity recognition 
of ontology terms in PubMed abstracts and results 
were integrated with SCAIView [61] for context-sensitive 
visualization of query results. 



classified into two groups based on their contents: struc- 
tured content and unstructured content. Concepts from 
structured contents such as tables, ontologies, and lists 
were integrated automatically whereas concepts from 
unstructured contents such as free text of publications 
were manually inspected and extracted with the help of 
annotation tools. Figure 5 describes the cardinal map- 
ping of resources to the ontology contents. All concepts 
in the ontology were annotated by additional informa- 
tion including synonym(s), definition(s), and reference 
(s). In the BFO version of the CSEO, relationships 
among concepts were defined based on the standard re- 
lation types in the Relation Ontology (RO) [63] and were 
checked using the HermiT reasoner. 



Ontology development and evaluation process 

During the process of ontology building, a hybrid ap- 
proach combining both bottom-up and top-down 
methods was adopted so that the ontology was popu- 
lated at the level of superclasses and subclasses simul- 
taneously. The development of the CSEO was 
accomplished in four phases according to the common 
life cycle of the ontology building [62]. 

Phase I: Knowledge acquisition and conceptualization 

Concepts were extracted from previously identified 
resources (see Additional file 4). Resources were 



Phase II: Terminology analysis and concept enrichment 

Transformation of the ontology OWL format into a 
dictionary file was achieved using a Java script. The 
script extracts concept names and the corresponding 
synonyms from the ontology OWL structure and assigns 
unique identifiers to each concept. This dictionary 
was incorporated into ProMiner for named entity 
recognition. In a subsequent step, the major super- 
class concepts were used as keywords for queries in 
PubMed. Five hundred relevant abstracts were chosen 
from the result list of each concept search. After 
compiling all abstracts, the corpus was randomly 



CSEO 
Resources 



it 



Book: Cigarette Smoke Toxicity by David Bernhard 
Exposure Stressor -> Tobacco product 



J 



Literature - -|An Overview by EU commission: Tobacco, Cigarettes and Cigarette Smoke 



Various parts of the ontology 



500 published abstracts from CORESTA, MRB, SOT, TSRC, Selventa, and Vitrocell. 
Exposure Event -> Spatial quality 



Encyclo- 
pedias and 



Online 
Resources 



Provantis 




Exposure Outcome -> Clinical activity 


Labvantage 




Exposure Receptor -> Organism 


NCI Thesaurus 




Exposure Receptor -> Anatomic Structure, System, or Substance 


C-DISC 




Exposure Outcome -> Biological response 


ArrayExpress 




Experiment -> Experimental factor -> Measurement 


Measurement Units Ontology 




Experiment -> Experimental factor -> Unit 


MGED Ontology 




Experiment -> Experimental factor -> Protocol 


Experimental Factor Ontology 




Experiment -> Experimental factor -> Protocol 


NCBITaxon 




Exposure Outcome -> Finding 


OpenToxipedia 




Various parts of the ontology 


ExO 




Exposure event/outcome/receptor/stressor 


BFO 




Entity -> Continuant/Occurent 



Figure 5 Mapping resources used for generating the ontology contents to their corresponding branches in the CSEO. 
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divided into a training set (250 abstracts) and test set 
(250 abstracts) using the randomization command in 
Linux. To create the reference gold standard, suitable 
annotation guidelines were developed so that annota- 
tors are guided to keep the breadth and depth of the 
ontology in mind. For enrichment purposes (here 
optimizing both the ontology concepts and the corre- 
sponding dictionary), the training set was analyzed for 
false-negative entities, which — after individual expert 
evaluation — was added to the ontology. Classes were an- 
notated both manually and automatically by mapping them 
to external ontologies. For this purpose, the National Cen- 
ter for Biomedical Ontology (NCBO) was used [64]. CSEO 
classes were manually annotated with equivalent external 
ontology classes using an annotation property. These anno- 
tations were then used to automatically retrieve synonym 
information via the NCBO services. The evaluation process 
required the performance comparison between automatic- 
ally and manually annotated text from the same set. 

Phase III: Evaluation 

A metric-based approach evaluating the ontology was 
used in three dimensions after the completion of the 
ontology [65]. Structural evaluation was performed by 
calculating features such as depth, breadth, and other 
topological features. To evaluate the functional quality 
of the ontology in terms of measuring the boundaries of 
the knowledge domain it captures, precision, recall, 
and F-score values were calculated. Precision is the 
number of true positives (TP) divided by the sum of 
TP and false positives (FP). Recall is the number of 
TP divided by the number of results that should 
have been returned (true positives (TP) + false nega- 
tives (FN)). The F-score = 2 x (precision x recall)/ 
(precision + recall). These values were derived from 
the longest string match found between automatically 
annotated words using ProMiner and the human-curated 
gold standard annotation for each abstract in the selected 
corpus [66]. 

Phase IV: Visualization of concepts through the text 

The ontology was integrated into the SCAIView literature 
mining and visualization environment. 

Additional files 
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